JSM 2025: Estimating the impact of ambient air pollution on overall mortality in the presence of preferential sampling and measurement error using electronic health record data

Authors
Affiliation

Tae Yoon Lee

Quantitative Sciences Unit, Stanford University School of Medicine

Chloe C. Su

Quantitative Sciences Unit, Stanford University School of Medicine

Summer S. Han

Quantitative Sciences Unit, Stanford University School of Medicine

Published

August 2, 2025

Abstract

A network of monitoring sites is often not well-designed for accurately mapping ambient (outdoor) air pollution due to external factors, such as budget constraints and public opinions. As such, naively using point measurements from the monitoring network can lead to biased mapping. This can have profound downstream implications for environmental health studies when this mapping is used to estimate ambient air pollution exposure at participants’ locations. In this talk, we will address this potential bias due to preferential sampling in the design of a monitoring network for estimating the ambient air pollution field in California. We will utilize a recently developed spatio-temporal statistical framework that simultaneously models the air pollution field and monitoring site selection process. Further, we will examine the downstream implications in quantifying the effects of ambient air pollution on 5-year overall mortality among patients diagnosed with lung cancer using electric health record data from Stanford Health Care, an academic medical center. We will employ a Bayesian model to incorporate the measurement error in the air pollution exposure.

Part II: Outdoor Air Pollution Modeling

This part requires computation. The R version and platform used to generate the results are shown at the end.

Import data and packages

library(INLA)
Loading required package: Matrix
This is INLA_25.06.07 built 2025-06-11 19:15:03 UTC.
 - See www.r-inla.org/contact-us for how to get help.
 - List available models/likelihoods/etc with inla.list.models()
 - Use inla.doc(<NAME>) to access documentation
library(inlabru)
Loading required package: fmesher
library(sf)
Linking to GEOS 3.13.0, GDAL 3.5.3, PROJ 9.5.1; sf_use_s2() is TRUE
library(sp)
library(reshape2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
library(tibble)
library(tidyr)

Attaching package: 'tidyr'
The following object is masked from 'package:reshape2':

    smiths
The following objects are masked from 'package:Matrix':

    expand, pack, unpack
library(stringr)

df_border <- read_rds("../data/df_border_scaled.rds")
df_border_unscaled <- st_union(read_rds("../data/ca_border.rds"))
df_pm25_annual <- read_rds("../data/df_pm25_annual_scaled.rds") %>% 
  group_by(site_number) %>% 
  mutate(n=n()) %>% 
  ungroup() %>% 
  # remove if the site was active only for one year
  filter(n>2) %>% 
  select(-n) %>% 
  mutate(site_number=as.factor(as.numeric(as.factor(site_number))))
xlabel <- expression(paste("Annual average PM 2.5 (",mu, "g", "/", m^3, ")",sep=""))

Data Preparation for Modeling

To prepare the data for modeling, we perform the following steps:

  1. Calculate the annual exposure of PM 2.5 for each monitoring site.

  2. Descale the annual exposure by dividing it the overall mean across the years and sites.

  3. Use the logarithm of the descaled annual exposure.

  4. For each site and year, generate the presence indicator \(R\) and its lag to incorporate the retention effect.

  5. Convert year into the 0-1 scale and rename it as time and create an auxiliary variable zero.

pm_annual_mean <- df_pm25_annual %>% 
  group_by(year) %>% 
  summarise(average = mean(annual_mean))

average_pm_annual <- mean(pm_annual_mean$average)
write_rds(average_pm_annual,"../results/average_pm_annual.rds")
df_pm25_annual$log_annual_mean <- log(df_pm25_annual$annual_mean) - log(average_pm_annual)

num_sites <- length(unique(df_pm25_annual$site_number))
num_timepoints <- length(unique(df_pm25_annual$year))
range_year <- range(df_pm25_annual$year)

df_pm25_flat <- dcast(df_pm25_annual, site_number + north + east ~ year, value.var = "log_annual_mean")
df_pm25  <- melt(df_pm25_flat, id.vars = c(1,2,3), variable.name = 'year', value.name = 'log_annual_mean')
df_pm25$year <- as.numeric(as.character(factor(df_pm25$year, labels = range_year[1]:range_year[2])))

ggplot(data=df_pm25 %>% 
         mutate(presence = !is.na(log_annual_mean),
                site_number=as.factor(site_number)),
       aes(y=site_number,x=year,fill=presence)) +
  geom_tile() +
  ylab("Monitoring site number") +
  xlab("Year") +
  scale_fill_manual(values = c("grey90","black"),name="Present?") +
  theme_classic(base_size=20) +
  theme(legend.position = "top")

df_var_annual <- df_pm25 %>% 
  group_by(year) %>% 
  summarise(var_pm = var(log_annual_mean, na.rm = T))

# rescale time
df_pm25 <- df_pm25 %>% 
  mutate(time = (year - range_year[1])/(range_year[2]-range_year[1]),
         site_number = as.numeric(as.factor(site_number)))
df_pm25$locs <- coordinates(df_pm25[, c("east", "north")])
df_pm25 <- df_pm25 %>% 
  select(-east,-north)

# Site-selection indicator
df_pm25 <- df_pm25 %>% 
  mutate(R = as.numeric(!is.na(log_annual_mean)))

df_pm25$R_lag <- c(rep(NA, num_sites), df_pm25$R[1:(dim(df_pm25)[1]-num_sites)])

# Response variable for the auxiliary model
df_pm25$zero <- 0

Building the Mesh for INLA

Fitting a Bayesian spatio-temporal model can be extremely computionally demanding. Integrated Nested Laplace Approximation utilizes the Stochastic Partial Differentiation Equation (SPDE) to estimate the spatial autocorrelation much more efficiently based on a mesh - a network of discrete sampling locations which are interpolated to approximate a continuous process (e.g., air pollution field) in space. For more details, please see the INLA tutorials.

The size of the mesh plays a crucial role. If it is too big, the fitting process will be too crude. If it too small, the fitting process could be too computationally demanding, and it might cause convergence issues in our study, as there aren’t too many observations. We choose the following mesh by trial-and-error:

edge_in <- 0.5
edge_out <- 2 * edge_in
mesh <- fm_mesh_2d_inla(loc = cbind(df_pm25$east, df_pm25$north),
                       boundary = df_border,
                       offset = c(2*edge_in, edge_out), 
                       max.edge = 2*c(edge_in, edge_out),
                       cutoff = 1*edge_in,
                       min.angle = 21)

ggplot(df_pm25_flat) + 
  gg(mesh) + 
  geom_point(aes(x = east, y = north), color = "blue",size=5) + 
  xlab("East / 10km") +
  ylab("North / 10km") +
  coord_fixed() +
  theme_minimal(base_size=20)

# Maybe we should consider set the PC prior using data info
spde_obj <- inla.spde2.pcmatern(mesh = mesh, 
                                alpha = 2, 
                                prior.range = c(2.5*edge_in, 0.01),
                                prior.sigma = c(1.5*sqrt(mean(df_var_annual$var_pm)), 0.1),
                                constr = T)

Generating Pseudo-Sites

We add pseudo-sites (i.e., locations that are not monitored) to all vertices in the mesh and create an expanded dataset in order to capture the impact of PS on pseudo-sites. For details, see Watson, Zidek, and Shaddick (2019).

# expanded dataset
df_pm25_flat <- dcast(df_pm25_annual, site_number + north + east ~ year, value.var = "log_annual_mean")
utm_crs_km <- st_crs("+proj=utm +zone=10 +ellps=WGS84 +units=km")
nodes_locs <- data.frame("N" = mesh$loc[, 2]*10, "E" = mesh$loc[, 1]*10) %>%
  st_as_sf(coords = c("E", "N"), crs = utm_crs_km)
idx_in <- st_contains(df_border_unscaled, nodes_locs) %>% 
  as.matrix() %>% 
  c()

sites_locs <- data.frame(east = df_pm25_flat$east,
                         north = df_pm25_flat$north)
nodes_in_locs <- data.frame(east = mesh$loc[idx_in, 1], 
                            north = mesh$loc[idx_in, 2])
pseudo_sites <- setdiff(nodes_in_locs, 
                        sites_locs) 

df_pm25_flat_pseudo <- data.frame(matrix(data=NA, 
                                         nrow=dim(pseudo_sites)[1], 
                                         ncol=dim(df_pm25_flat)[2])) %>%
  `colnames<-`(colnames(df_pm25_flat)) %>% 
  mutate(north = pseudo_sites$north,
         east = pseudo_sites$east)

df_pm25_flat_expand <- df_pm25_flat %>% 
  rbind(df_pm25_flat_pseudo)

# Create site number for all pseudo and real sites
df_pm25_flat_expand$site_number <- 1:nrow(df_pm25_flat_expand)

ggplot(df_pm25_flat_expand) + gg(mesh) + 
  geom_point(aes(x = east, y = north), 
             color = c(rep("blue", nrow(df_pm25_flat)), 
                       rep("green", nrow(df_pm25_flat_pseudo))),
             size = c(rep(5,nrow(df_pm25_flat)),
                      rep(2, nrow(df_pm25_flat_pseudo))),
             alpha=c(rep(1,nrow(df_pm25_flat)),
                      rep(0.5, nrow(df_pm25_flat_pseudo)))) + 
  xlab("East / 10km") +
  ylab("North / 10km") + 
  coord_fixed() +
  theme_minimal(base_size=20)

# Transform it into long format 
df_pm25_expand <- melt(df_pm25_flat_expand, id.vars = c(1,2,3), variable.name = 'year', value.name = 'log_annual_mean')
no_sites_expand <- nrow(df_pm25_flat_expand)
stopifnot(nrow(df_pm25_expand) == no_sites_expand * num_timepoints)

df_pm25_expand <- df_pm25_expand %>% 
  mutate(time = as.numeric(year),
         time = (time-min(time))/(max(time)-min(time)))
df_pm25_expand$locs <- coordinates(df_pm25_expand[, c("east", "north")])
df_pm25_expand <- select(df_pm25_expand, !c("east", "north"))
df_pm25_expand$site_number <- as.numeric(as.factor(df_pm25_expand$site_number))

# Site-selection indicator
df_pm25_expand <- df_pm25_expand %>% 
  mutate(R=as.numeric(!is.na(log_annual_mean)),
         R_lag=c(rep(NA, no_sites_expand),
                 R[1:(nrow(df_pm25_expand)-no_sites_expand)]),
         # the auxiliary variable
         zero = 0)

Model Fitting

We will fit three models:

  • Fitting the site process and air pollution field independently under the assumption of no PS.

  • Fitting the site process and air pollution field jointly to detect the PS.

  • Fitting the site process and air pollution field jointly on the expanded pseudo-site dataset to detect the PS AND estimate the air pollution field that accounts for potential effect of PS.

Fitting the site process and the air pollution field independently

Please see the slides for details. It takes less than <1 min to fit the models independently (which may vary across R versions and platforms).

comp_joint_indep <- ~ 
  # Components for observation model
  Intercept_obs(1) +
  Time_obs_1(time) +
  Time_obs_2(time^2) +
  Random_obs_0(site_number, model = "iid2d", n = num_sites*2, constr=TRUE) +
  Random_obs_1(site_number, weights = time, copy = "Random_obs_0") +
  Spatial_obs_0(locs, model = spde_obj) +
  Spatial_obs_1(locs, weights = time, model = spde_obj) +
  Spatial_obs_2(locs, weights = time^2, model = spde_obj) +

  # Components for site selection model
  Intercept_slc(1) + 
  Time_slc_1(time) +
  Time_slc_2(time^2) +  
  R_lag_slc(R_lag) +
  AR_slc(year, model='ar1', hyper=list(theta1=list(prior="pcprec",param=c(2, 0.01)))) +
  Spatial_slc(locs, model = spde_obj)

like_obs <- bru_obs(
  formula = log_annual_mean ~ Intercept_obs + Time_obs_1 + Time_obs_2 +
    Random_obs_0 + Random_obs_1 +
    Spatial_obs_0 + Spatial_obs_1 + Spatial_obs_2,
  family = "gaussian",
  data = df_pm25
)

like_slc <- bru_obs(
  formula = R ~ Intercept_slc + Time_slc_1 + Time_slc_2 +
    R_lag_slc + AR_slc + Spatial_slc,
  family = "binomial",
  Ntrials = rep(1, times = length(df_pm25$R)),
  data = df_pm25
)

bru_options_reset()
bru_options_set(bru_max_iter = 10,
                control.inla = list(strategy = "gaussian",
                                    int.strategy = 'eb'),
                control.predictor=list(link=1),
                bru_verbose = 3)

start.time <- Sys.time()
fit_bru_indep <- bru(comp_joint_indep, like_obs, like_slc)
end.time <- Sys.time()
time.taken.independent <- end.time - start.time
time.taken.independent # 1 min
write_rds(fit_bru_indep,"../results/bru_indep.rds")

For the joint models, we will exclude the spatial fields that evolve over time due to substantial uncertainty in their estimates: Spatial_obs_1 and Spatial_obs_2.

fit_bru_indep <- read_rds("../results/bru_indep.rds")
summary(fit_bru_indep)
inlabru version: 2.13.0
INLA version: 25.06.07
Components:
Intercept_obs: main = linear(1), group = exchangeable(1L), replicate = iid(1L), NULL
Time_obs_1: main = linear(time), group = exchangeable(1L), replicate = iid(1L), NULL
Time_obs_2: main = linear(time^2), group = exchangeable(1L), replicate = iid(1L), NULL
Random_obs_0: main = iid2d(site_number), group = exchangeable(1L), replicate = iid(1L), NULL
Random_obs_1(=Random_obs_0): main = unknown(site_number), group = exchangeable(1L), replicate = iid(1L), weights = time
Spatial_obs_0: main = spde(locs), group = exchangeable(1L), replicate = iid(1L), NULL
Spatial_obs_1: main = spde(locs), group = exchangeable(1L), replicate = iid(1L), weights = time
Spatial_obs_2: main = spde(locs), group = exchangeable(1L), replicate = iid(1L), weights = time^2
Intercept_slc: main = linear(1), group = exchangeable(1L), replicate = iid(1L), NULL
Time_slc_1: main = linear(time), group = exchangeable(1L), replicate = iid(1L), NULL
Time_slc_2: main = linear(time^2), group = exchangeable(1L), replicate = iid(1L), NULL
R_lag_slc: main = linear(R_lag), group = exchangeable(1L), replicate = iid(1L), NULL
AR_slc: main = ar1(year), group = exchangeable(1L), replicate = iid(1L), NULL
Spatial_slc: main = spde(locs), group = exchangeable(1L), replicate = iid(1L), NULL
Observation models:
  Family: 'gaussian'
    Tag: <No tag>
    Data class: 'data.frame'
    Response class: 'numeric'
    Predictor: 
        log_annual_mean ~ Intercept_obs + Time_obs_1 + Time_obs_2 + Random_obs_0 + 
            Random_obs_1 + Spatial_obs_0 + Spatial_obs_1 + Spatial_obs_2
    Additive/Linear: TRUE/TRUE
    Used components: effects[Intercept_obs, Time_obs_1, Time_obs_2, Random_obs_0, Random_obs_1, Spatial_obs_0, Spatial_obs_1, Spatial_obs_2], latent[]
  Family: 'binomial'
    Tag: <No tag>
    Data class: 'data.frame'
    Response class: 'numeric'
    Predictor: 
        R ~ Intercept_slc + Time_slc_1 + Time_slc_2 + R_lag_slc + AR_slc + 
            Spatial_slc
    Additive/Linear: TRUE/TRUE
    Used components: effects[Intercept_slc, Time_slc_1, Time_slc_2, R_lag_slc, AR_slc, Spatial_slc], latent[]
Time used:
    Pre = 5.33, Running = 28.4, Post = 0.142, Total = 33.9 
Fixed effects:
                mean    sd 0.025quant 0.5quant 0.975quant   mode kld
Intercept_obs  0.277 0.096      0.089    0.277      0.464  0.277   0
Time_obs_1    -1.541 0.147     -1.830   -1.541     -1.252 -1.541   0
Time_obs_2     1.035 0.130      0.781    1.035      1.289  1.035   0
Intercept_slc -2.540 0.606     -3.727   -2.540     -1.352 -2.540   0
Time_slc_1    -2.538 3.033     -8.483   -2.538      3.407 -2.538   0
Time_slc_2     3.606 3.153     -2.574    3.606      9.786  3.606   0
R_lag_slc      8.143 0.689      6.792    8.143      9.494  8.143   0

Random effects:
  Name    Model
    Random_obs_0 IID2D model
   Spatial_obs_0 SPDE2 model
   Spatial_obs_1 SPDE2 model
   Spatial_obs_2 SPDE2 model
   AR_slc AR1 model
   Spatial_slc SPDE2 model
   Random_obs_1 Copy

Model hyperparameters:
                                           mean     sd 0.025quant 0.5quant
Precision for the Gaussian observations  29.459  2.281     25.376   29.322
Precision for Random_obs_0 (component 1) 22.649  6.425     12.485   21.827
Precision for Random_obs_0 (component 2)  3.360  2.912      0.437    2.537
Rho1:2 for Random_obs_0                   0.000  0.208     -0.403    0.000
Range for Spatial_obs_0                  20.422 13.417      5.412   17.036
Stdev for Spatial_obs_0                   0.344  0.117      0.172    0.325
Range for Spatial_obs_1                  13.017 20.530      1.321    7.172
Stdev for Spatial_obs_1                   0.103  0.086      0.015    0.079
Range for Spatial_obs_2                  14.234 24.365      1.230    7.417
Stdev for Spatial_obs_2                   0.112  0.085      0.019    0.089
Precision for AR_slc                      1.244  0.859      0.318    1.020
Rho for AR_slc                           -0.450  0.347     -0.903   -0.528
Range for Spatial_slc                    11.180 15.231      1.271    6.716
Stdev for Spatial_slc                     0.269  0.273      0.025    0.187
                                         0.975quant   mode
Precision for the Gaussian observations      34.349 28.938
Precision for Random_obs_0 (component 1)     37.563 20.296
Precision for Random_obs_0 (component 2)     11.054  1.228
Rho1:2 for Random_obs_0                       0.405 -0.001
Range for Spatial_obs_0                      55.635 11.975
Stdev for Spatial_obs_0                       0.628  0.289
Range for Spatial_obs_1                      62.094  3.027
Stdev for Spatial_obs_1                       0.331  0.041
Range for Spatial_obs_2                      70.854  2.885
Stdev for Spatial_obs_2                       0.335  0.052
Precision for AR_slc                          3.513  0.700
Rho for AR_slc                                0.396 -0.729
Range for Spatial_slc                        49.303  3.011
Stdev for Spatial_slc                         0.994  0.071

Marginal log-Likelihood:  -97.46 
 is computed 
Posterior summaries for the linear predictor and the fitted values are computed
(Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')

Joint modeling of the air pollution map and site process

It takes less than <2 min to fit the models jointly (which may vary across R versions and platforms).

inla.setOption(fmesher.timeout=120)
inla.setOption(inla.timeout=600)
inla.setOption(safe=T)

comp_aux <- ~ Intercept_obs(1) + # Components for observation model
  Time_obs_1(time) +
  Time_obs_2(time^2) +
  Random_obs_0(site_number, model = "iid2d", n = num_sites*2, constr=TRUE) +
  Random_obs_1(site_number, weights = time, copy = "Random_obs_0") +
  Spatial_obs_0(locs, model = spde_obj) +
  # Spatial_obs_1(locs, weights = time, model = spde_obj) +
  # Spatial_obs_2(locs, weights = time^2, model = spde_obj) +
  
  # Components for site selection model
  Intercept_slc(1) + 
  Time_slc_1(time) +
  Time_slc_2(time^2) +
  R_lag_slc(R_lag) +
  AR_slc(year, model='ar1', hyper=list(theta1=list(prior="pcprec",param=c(2, 0.01)))) +
  Spatial_slc(locs, model = spde_obj) +
  Comp_share1(site_number, copy = 'Comp_aux1', fixed = FALSE) +
  Comp_share2(site_number, copy = 'Comp_aux2', fixed = FALSE) +
  
  # Components for 1st auxiliary model
  Random_aux1_0(site_number, copy = "Random_obs_0", fixed = TRUE) +
  Random_aux1_1(site_number, weights = time, copy = "Random_obs_1", 
                fixed = TRUE) +
  Comp_aux1(site_number, model = 'iid', hyper = list(prec = list(initial = -20, fixed=TRUE))) +
  
  # Components for 2nd auxiliary model
  Spatial_aux2_0(locs, copy = "Spatial_obs_0", fixed = TRUE) +
  # Spatial_aux2_1(locs, weights = time, copy = "Spatial_obs_1", fixed = TRUE) +
  # Spatial_aux2_2(locs, weights = time^2, copy = "Spatial_obs_2", fixed = TRUE) +
  Comp_aux2(site_number, model = 'iid', hyper = list(prec = list(initial = -20, fixed=TRUE)))


# All likelihoods
like_obs <- bru_obs(
  formula = log_annual_mean ~ Intercept_obs + Time_obs_1 + Time_obs_2 + 
    Random_obs_0 + Random_obs_1 + 
    Spatial_obs_0, #+ Spatial_obs_1, #+ Spatial_obs_2,
  family = "gaussian",
  data = df_pm25
)

like_slc_share <- bru_obs(
  formula = R ~ Intercept_slc + Time_slc_1 + Time_slc_2 + 
    R_lag_slc + 
    AR_slc +
    Spatial_slc +
    Comp_share1 +
    Comp_share2,
  family = "binomial",
  Ntrials = rep(1, times = length(df_pm25$R)),
  data = df_pm25
)

like_aux_1 <- bru_obs(
  formula = zero ~ Random_aux1_0 + Random_aux1_1 +  Comp_aux1,
  family = "gaussian",
  data = df_pm25
)

like_aux_2 <- bru_obs(
  formula = zero ~ Spatial_aux2_0 +#  Spatial_aux2_1 +  #+ Spatial_aux2_2 +
    Comp_aux2,
  family = "gaussian",
  data = df_pm25
)

bru_options_reset()
bru_options_set(bru_max_iter = 20,
                control.inla = list(strategy = "auto", 
                                    int.strategy = 'auto',
                                    verbose=2),
                control.predictor=list(link=1),
                control.family = list(
                  list(),
                  list(),
                  list(hyper = list(prec = list(initial = 30, fixed=T))),
                  list(hyper = list(prec = list(initial = 30, fixed=T)))),
                bru_verbose = 3)
start_time_joint <- Sys.time()
fit_bru_joint <- bru(comp_aux, like_obs, like_slc_share, like_aux_1, like_aux_2)
end_time_joint <- Sys.time()
runtime_joint <- end_time_joint - start_time_joint
runtime_joint
write_rds(fit_bru_joint,"../results/bru_joint.rds")

Joint modeling of the air pollution map and site process on the expanded dataset

It takes less than <5 min to fit the models jointly (which may vary across R versions and platforms).

inla.setOption(fmesher.timeout=120)
inla.setOption(inla.timeout=1000)
inla.setOption(safe=T)
comp_aux_expand <- ~ 
  # Components for observation model
  Intercept_obs(1) + 
  Time_obs_1(time) +
  Time_obs_2(time^2) +
  Random_obs_0(site_number, model = "iid2d", n = no_sites_expand*2, constr=TRUE) +
  Random_obs_1(site_number, weights = time, copy = "Random_obs_0") +
  Spatial_obs_0(locs, model = spde_obj) +
  
  # Components for site selection model
  Intercept_slc(1) + Time_slc_1(time) +
  Time_slc_2(time^2) +
  R_lag_slc(R_lag) +
  AR_slc(year, model='ar1', hyper=list(theta1=list(prior="pcprec",param=c(2, 0.01)))) +
  Spatial_slc(locs, model = spde_obj) +
  Comp_share1(site_number, copy = 'Comp_aux1', fixed = FALSE) + 
  Comp_share2(site_number, copy = 'Comp_aux2', fixed = FALSE) +
  
  # Components for 1st auxiliary model
  Random_aux1_0(site_number, copy = "Random_obs_0", fixed = TRUE) +
  Random_aux1_1(site_number, weights = time, copy = "Random_obs_1", fixed = TRUE) +
  Comp_aux1(site_number, model = 'iid', hyper = list(prec = list(initial = -20, fixed=TRUE))) +
  
  # Components for 2nd auxiliary model
  Spatial_aux2_0(locs, copy = "Spatial_obs_0", fixed = TRUE) +
  Comp_aux2(site_number, model = 'iid', hyper = list(prec = list(initial = -20, fixed=TRUE)))


# All likelihoods
like_obs <- bru_obs(
  formula = log_annual_mean ~ Intercept_obs + Time_obs_1 + Time_obs_2 + 
    Random_obs_0 + Random_obs_1 + 
    Spatial_obs_0, 
  family = "gaussian",
  data = df_pm25_expand
)

like_slc_share <- bru_obs(
  formula = R ~ Intercept_slc + Time_slc_1 + Time_slc_2 + 
    R_lag_slc + 
    AR_slc +
    Spatial_slc +
    Comp_share1 + Comp_share2,
  family = "binomial",
  Ntrials = rep(1, times = length(df_pm25_expand$R)),
  data = df_pm25_expand
)

like_aux_1 <- bru_obs(
  formula = zero ~ Random_aux1_0 + Random_aux1_1 + Comp_aux1,
  family = "gaussian",
  data = df_pm25_expand
)

like_aux_2 <- bru_obs(
  formula = zero ~ Spatial_aux2_0 + 
    Comp_aux2,
  family = "gaussian",
  data = df_pm25_expand
)

bru_options_reset()
bru_options_set(bru_max_iter = 20,
                control.inla = list(strategy = "auto", 
                                    int.strategy = 'auto'),
                control.predictor=list(link=1),
                control.family = list(
                  list(),
                  list(),
                  list(hyper = list(prec = list(initial = 30, fixed=TRUE))),
                  list(hyper = list(prec = list(initial = 30, fixed=TRUE)))),
                bru_verbose = 3)

start_time_expand <- Sys.time()
fit_bru_joint_expand <- bru(comp_aux_expand, like_obs, 
                          like_slc_share, like_aux_1, like_aux_2)
end_time_expand <- Sys.time()
runtime_expand <- start_time_expand - end_time_expand
runtime_expand
write_rds(fit_bru_joint_expand,"../results/bru_joint_expand.rds")

Model parameters

We extract the key parameters from all the models fitted.

fit_bru_indep <- read_rds("../results/bru_indep.rds")            # the joint model on the original dataset
fit_bru_joint <- read_rds("../results/bru_joint.rds")            # the joint model on the original dataset
fit_bru_joint_expand <- read_rds("../results/bru_joint_expand.rds")  # the joint model on the expanded dataset
average_pm_annual <- read_rds("../results/average_pm_annual.rds")

chosen_row_names <- c("Intercept_obs","Time_obs_1","Time_obs_2",
                      "Rho1:2 for Random_obs_0",
                      "Range for Spatial_obs_0",
                      "Stdev for Spatial_obs_0",
                      "Intercept_slc","Time_slc_1","Time_slc_2",
                      "R_lag_slc",
                      "Rho for AR_slc",
                      "Precision for AR_slc",
                      "Range for Spatial_slc",
                      "Stdev for Spatial_slc",
                      "Beta for Comp_share1",
                      "Beta for Comp_share2")

row_names_labels <- c("gamma_0","gamma_1","gamma_2",
                      "rho_Z",
                      "range_obs","std_obs",
                      "alpha_0","alpha_1","alpha_2",
                      "alpha_{retention}",
                      "rho_R","1/sigma^2_R",
                      "range_slc", "std_slc",
                      "d_b","d_\beta")

df_param_naive <- rbind(fit_bru_indep$summary.fixed %>% 
  select(mean,sd) %>% 
  tibble::rownames_to_column(),
fit_bru_indep$summary.hyperpar %>% 
  select(mean,sd) %>% 
  tibble::rownames_to_column()) %>% 
  mutate(type="naive")

df_param_expand <- rbind(fit_bru_joint_expand$summary.fixed %>% 
  select(mean,sd) %>% 
  rownames_to_column(),
fit_bru_joint_expand$summary.hyperpar %>% 
  select(mean,sd) %>% 
  rownames_to_column()) %>% 
  mutate(type="pseudo-site")

df_param_joint <- rbind(fit_bru_joint$summary.fixed %>% 
  select(mean,sd) %>% 
  rownames_to_column(),
fit_bru_joint$summary.hyperpar %>% 
  select(mean,sd) %>% 
  rownames_to_column()) %>% 
  mutate(type="presence only")

df_param <- rbind(df_param_naive,
                  df_param_expand,
                  df_param_joint)

df_chosen_param <- df_param %>% 
  filter(rowname %in% chosen_row_names) %>% 
  mutate(mean = formatC(round(mean,2),format='f',digits=2),
         sd =  formatC(round(sd,2),format='f',digits=2),
         val = str_c(mean, " (",sd,")")) %>% 
  select(-mean,-sd) %>% 
  pivot_wider(names_from=type,values_from=val)

chosen_row_names_obs <- c("Intercept_obs","Time_obs_1","Time_obs_2",
                      "Rho1:2 for Random_obs_0",
                      "Range for Spatial_obs_0",
                      "Stdev for Spatial_obs_0")

chosen_row_names_slc <- c("Intercept_slc","Time_slc_1","Time_slc_2",
                      "R_lag_slc",
                      "Rho for AR_slc",
                      "Precision for AR_slc",
                      "Range for Spatial_slc",
                      "Stdev for Spatial_slc",
                      "Beta for Comp_share1",
                      "Beta for Comp_share2")

row_names_labels_obs <- c("gamma_0","gamma_1","gamma_2",
                      "rho_Z",
                      "range_obs","std_obs")


row_names_labels_slc <- c("alpha_0","alpha_1","alpha_2",
                      "alpha_{retention}",
                      "rho_R","1/sigma^2_R",
                      "range_slc", "std_slc",
                      "d_b","d_\beta")

df_obs <- df_chosen_param %>% 
  filter(rowname %in% chosen_row_names_obs) %>% 
  mutate(rowname=factor(rowname,levels=chosen_row_names_obs,
                        labels = row_names_labels_obs)) %>% 
  select(rowname,naive,`presence only`,`pseudo-site`)

df_slc <- df_chosen_param %>% 
    filter(rowname %in% chosen_row_names_slc) %>% 
  mutate(rowname=factor(rowname,levels=chosen_row_names_slc,
                        labels = row_names_labels_slc)) %>% 
  select(rowname,naive,`presence only`,`pseudo-site`)

df_obs
# A tibble: 6 × 4
  rowname   naive         `presence only` `pseudo-site`
  <fct>     <chr>         <chr>           <chr>        
1 gamma_0   0.28 (0.10)   0.27 (0.07)     0.28 (0.06)  
2 gamma_1   -1.54 (0.15)  -1.53 (0.13)    -1.53 (0.14) 
3 gamma_2   1.03 (0.13)   1.03 (0.12)     1.03 (0.12)  
4 rho_Z     0.00 (0.21)   -0.02 (0.11)    -0.00 (0.02) 
5 range_obs 20.42 (13.42) 15.99 (3.50)    12.34 (0.70) 
6 std_obs   0.34 (0.12)   0.29 (0.05)     0.24 (0.01)  
df_slc
# A tibble: 10 × 4
   rowname             naive         `presence only` `pseudo-site`
   <fct>               <chr>         <chr>           <chr>        
 1 "alpha_0"           -2.54 (0.61)  -2.30 (0.78)    -5.75 (0.45) 
 2 "alpha_1"           -2.54 (3.03)  -2.72 (3.73)    -5.21 (2.34) 
 3 "alpha_2"           3.61 (3.15)   3.60 (3.82)     3.75 (2.42)  
 4 "alpha_{retention}" 8.14 (0.69)   7.88 (0.67)     11.94 (0.64) 
 5 "1/sigma^2_R"       1.24 (0.86)   1.26 (0.28)     14.43 (0.78) 
 6 "rho_R"             -0.45 (0.35)  0.24 (0.11)     0.82 (0.01)  
 7 "range_slc"         11.18 (15.23) 4.72 (1.33)     5.78 (0.41)  
 8 "std_slc"           0.27 (0.27)   0.12 (0.03)     0.68 (0.03)  
 9 "d_b"               <NA>          0.96 (0.26)     0.97 (0.06)  
10 "d_\beta"           <NA>          0.81 (0.18)     0.15 (0.05)  
write_csv(df_obs,"../results/obs_param_table.csv")
write_csv(df_slc,"../results/slc_param_table.csv")

Estimated air pollution field

We first estimate the outdoor air pollution field for PM2.5 using the pseudo-site locations.

set.seed(2025)
pred_bru_naive <- generate(fit_bru_indep, 
                     df_pm25_expand, 
                     ~ exp(Intercept_obs + Time_obs_1 + Time_obs_2 + 
                             Random_obs_0 + Random_obs_1 + 
                           Spatial_obs_0),
                     n.samples = 1000) %>%
  as.data.frame() %>%
  `*`(average_pm_annual)

df_pred_naive <- df_pm25_expand %>% 
  select(year,locs,R)

df_pred_naive$ann_mean <- apply(pred_bru_naive,1,mean)
df_pred_naive$ann_sd <- apply(pred_bru_naive,1,sd)
df_pred_naive$q_low <- apply(pred_bru_naive,1,function(x){quantile(x,0.025)})
df_pred_naive$q_high <- apply(pred_bru_naive,1,function(x){quantile(x,0.975)})
df_pred_naive$east <- df_pred_naive$locs[,1]
df_pred_naive$north <- df_pred_naive$locs[,2]

df_naive <- st_as_sf(df_pred_naive,coords = c("east","north"))

`pred_bru_adjusted <- generate(fit_bru_joint_expand, 
                     df_pm25_expand, 
                     ~ exp(Intercept_obs + Time_obs_1 + Time_obs_2 + 
                             Random_obs_0 + Random_obs_1 + 
                           Spatial_obs_0),
                     n.samples = 1000) %>%
  as.data.frame() %>%
  `*`(average_pm_annual)

df_pred_adjusted <- df_pm25_expand %>% 
  select(year,locs,R)

df_pred_adjusted$ann_mean <- apply(pred_bru_adjusted,1,mean)
df_pred_adjusted$ann_sd <- apply(pred_bru_adjusted,1,sd)
df_pred_adjusted$q_low <- apply(pred_bru_adjusted,1,function(x){quantile(x,0.025)})
df_pred_adjusted$q_high <- apply(pred_bru_adjusted,1,function(x){quantile(x,0.975)})
df_pred_adjusted$east <- df_pred_adjusted$locs[,1]
df_pred_adjusted$north <- df_pred_adjusted$locs[,2]

df_adjusted <- st_as_sf(df_pred_adjusted,coords = c("east","north"))

pred_bru_diff <- pred_bru_naive - pred_bru_adjusted

df_pred_diff <- df_pm25_expand %>% 
  select(year,locs,R)

df_pred_diff$ann_mean <- apply(pred_bru_diff,1,mean)
df_pred_diff$ann_sd <- apply(pred_bru_diff,1,sd)
df_pred_diff$q_low <- apply(pred_bru_diff,1,function(x){quantile(x,0.025)})
df_pred_diff$q_high <- apply(pred_bru_diff,1,function(x){quantile(x,0.975)})
df_pred_diff$east <- df_pred_diff$locs[,1]
df_pred_diff$north <- df_pred_diff$locs[,2]

df_diff <- st_as_sf(df_pred_diff,coords = c("east","north"))

df_all <- rbind(df_naive %>% 
                  mutate(type='no PS adjustment'),
                df_adjusted %>% 
                  mutate(type='PS adjustment'),
                df_diff %>% 
                  mutate(type='Difference'))
write_rds(df_all,"../results/pred_field.rds")
df_all <- read_rds("../results/pred_field.rds")
chosen_years <- c(2000,2010,2021)
ggplot(df_pm25_annual %>% 
         filter(year %in% chosen_years)) + 
  geom_sf(data=df_all %>% 
            filter(year%in% chosen_years) %>% 
            mutate(type = factor(type,
                                    levels = c("no PS adjustment",
                                               "PS adjustment",
                                               "Difference"),
                                    labels=c("Kriging",
                                             "Adjusted for PS",
                                             "Difference"))) %>% 
            filter(type!="Difference"),
          aes(color=ann_mean),size=3,alpha=0.8) +
  theme_minimal(base_size=20) +
  geom_point(aes(x=east,y=north,fill=annual_mean),size=5,pch=21,col='white') +
  facet_grid(type~year,switch='y')+
  scale_color_gradientn(limits=c(0,30),colors=c("white","red",'darkred'),name = xlabel) +
    scale_fill_gradientn(limits=c(0,30),colors=c("white","red",'darkred'),guide="none") +
  xlab("") +
  ylab("") +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        legend.position = 'bottom',
        legend.key.width = unit(1.5,'cm')) 
Warning: Combining variables of class <integer> and <factor> was deprecated in ggplot2
3.4.0.
ℹ Please ensure your variables are compatible before plotting (location:
  `combine_vars()`)
Warning: Combining variables of class <factor> and <integer> was deprecated in ggplot2
3.4.0.
ℹ Please ensure your variables are compatible before plotting (location:
  `combine_vars()`)

ggplot(df_pm25_annual %>% 
         filter(year %in% chosen_years)) + 
  geom_sf(data=df_all %>% 
            filter(year%in% chosen_years) %>% 
            mutate(type = factor(type,
                                    levels = c("no PS adjustment",
                                               "PS adjustment",
                                               "Difference"),
                                    labels=c("Kriging",
                                             "Adjusted for PS",
                                             "Difference"))) %>% 
            filter(type!="Difference") %>% 
            mutate(cov = ann_sd/ann_mean * 100),
          aes(color=ann_sd),size=3,alpha=0.95) +
  theme_minimal(base_size=20) +
  facet_grid(type~year,switch='y')+
    scale_color_gradientn(limits=c(0,45),colors=c('white','red','darkred'),
                          name = xlabel) +
  xlab("") +
  ylab("") +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        legend.position = 'bottom',
        legend.key.width = unit(1.5,'cm')) 

ggplot(df_pm25_annual %>% 
         filter(year %in% chosen_years)) + 
  geom_sf(data=df_all %>% 
            filter(year%in% chosen_years) %>% 
            mutate(type = factor(type,
                                    levels = c("no PS adjustment",
                                               "PS adjustment",
                                               "Difference"),
                                    labels=c("Kriging",
                                             "Adjusted for PS",
                                             "Difference"))) %>% 
            filter(type=="Difference"),
          aes(color=ann_mean),size=4,alpha=1) +
  theme_minimal(base_size=20) +
  facet_grid(.~year,switch='y')+
  scale_color_gradientn(limits=c(-2,30),colors=c("white","red",'darkred'),
                        name = xlabel) +
  xlab("") +
  ylab("") +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        legend.position = 'bottom',
        legend.key.width = unit(1.5,'cm')) 

Distance nearest to the monitoring site

We calculate the distance to the nearest monitoring site for each patient at baseline (i.e., one year prior to the year of diagnosis of lung cancer).

df_patients <- read_rds("../data/df_SHC_chosen.rds")
df_pt_coord <- st_as_sf(df_patients, coords = c("longitude", "latitude"), 
                        crs = 4326) %>%
  st_transform(crs=utm_crs_km)
df_within <- st_within(df_pt_coord %>% select(geometry),df_border_unscaled)
within_logical <- sapply(df_within, function(x) length(x) > 0)
df_patients <- df_patients[within_logical,]
df_pt_coord <- df_pt_coord[within_logical,]
  
df_monitor_coord <- st_as_sf(df_pm25_annual %>% 
                               mutate(east= east*10,
                                      north = north*10),
                             coords = c("east","north"),
                             crs = utm_crs_km)
vec_min_distance <- rep(NA,nrow(df_pt_coord))
for(i in 1:nrow(df_pt_coord)){

  tmp_year <- df_pt_coord$year[i] - 1
  vec_min_distance[i] <- st_distance(df_pt_coord[i,],
                                     df_monitor_coord %>%
                filter(year == tmp_year)) %>%
    as.vector() %>%
    min()
}

vec_min_distance <- vec_min_distance %>%
  as.data.frame()
colnames(vec_min_distance) <- "distance"
write_rds(vec_min_distance,"../results/vec_min_distance.rds")
vec_min_distance <- read_rds("../results/vec_min_distance.rds")
ggplot(data=vec_min_distance,aes(x=distance)) +
  geom_histogram(linewidth=2,binwidth=5) +
  xlab("Distance to the nearest monitor (km)") +
  ylab("Number of patients") +
  scale_x_continuous(breaks=seq(0,50,by=10)) +
  theme_minimal(base_size=35)

summary(vec_min_distance$distance)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.07413  5.31238  9.29516 10.60386 14.30765 53.35964 
quantile(vec_min_distance$distance,.9)
     90% 
20.02748 

Prediction at patient location

We predict the annual average exposure to PM2.5 for each patient in our cohort at baseline.

df_SHC <- read_rds("../data/df_SHC_chosen.rds")

df_baseline_pollution <- df_SHC %>% 
  select(id,year,latitude,longitude) %>% 
  st_as_sf(coords = c("longitude", "latitude"), 
                        crs = 4326) %>%
  st_transform(crs=utm_crs_km)
df_coords <- st_coordinates(df_baseline_pollution)
df_baseline_pollution$east <- df_coords[,1]/10
df_baseline_pollution$north <- df_coords[,2]/10
df_baseline_pollution <- df_baseline_pollution %>% 
  st_drop_geometry()
df_baseline_pollution$locs <- coordinates(df_baseline_pollution[, c("east", "north")])
df_pred_baseline_pollution <- df_baseline_pollution %>% 
  mutate(R=0,
         R_lag=0,
         zero=0,
         log_annual_mean=0,
         year=year-1,
         time = (year-1999) / (2021- 1999),
         site_number = NA)

df_pt_naive <- generate(fit_bru_joint, 
                     df_pred_baseline_pollution, 
                     ~ exp(Intercept_obs + Time_obs_1 + Time_obs_2 + 
                             Random_obs_0 + Random_obs_1 + 
                           Spatial_obs_0),
                     n.samples = 1000) %>%
  as.data.frame() %>%
  `*`(average_pm_annual)

df_pt_adjusted <- generate(fit_bru_joint_expand, 
                     df_pred_baseline_pollution %>% 
                       mutate(site_number=NA), 
                     ~ exp(Intercept_obs + Time_obs_1 + Time_obs_2 + 
                             Random_obs_0 + Random_obs_1 + 
                           Spatial_obs_0),
                     n.samples = 1000) %>%monitor site number
  as.data.frame() %>%
  `*`(average_pm_annual)

df_SHC$pred_pm_mean_naive <- apply(df_pt_naive,1,mean)
df_SHC$pred_pm_sd_naive <- apply(df_pt_naive,1,sd)
df_SHC$pred_pm_mean_adjusted <- apply(df_pt_adjusted,1,mean)
df_SHC$pred_pm_sd_adjusted <- apply(df_pt_adjusted,1,sd)

df_SHC$pred_pm_mean_naive %>% summary()
df_SHC$pred_pm_mean_adjusted %>% summary()
df_SHC$pred_pm_diff <- df_SHC$pred_pm_mean_naive - df_SHC$pred_pm_mean_adjusted 
df_SHC$min_distance <- vec_min_distance$distance


df_SHC <- df_SHC %>% 
  mutate(grp = Hmisc::cut2(min_distance,cuts=c(seq(0,40,by=5),55)))
saveRDS(df_SHC,"../results/df_SHC_pm25.rds")
df_SHC <- read_rds("../results/df_SHC_pm25.rds")

ggplot(data=df_SHC,
       aes(x=grp,y=pred_pm_diff)) +
  geom_boxplot(outlier.shape=NA) +
  ylab(expression(paste("Difference in annual average PM 2.5 (", mu, "g", "/", m^3, ")", sep = ""))) +
  xlab("Distance to the nearest monitoring site (km)") +
  theme_classic(base_size=20) +
  geom_hline(yintercept = 0,col='darkgrey',linewidth=2,alpha=0.8)+
  ylim(-0.25,0.25)
Warning: Removed 28 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Session info

sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] stringr_1.5.1  tidyr_1.3.1    tibble_3.3.0   ggplot2_3.5.2  readr_2.1.5   
 [6] dplyr_1.1.4    reshape2_1.4.4 sp_2.2-0       sf_1.0-21      inlabru_2.13.0
[11] fmesher_0.5.0  INLA_25.06.07  Matrix_1.7-3  

loaded via a namespace (and not attached):
 [1] utf8_1.2.6         generics_0.1.4     class_7.3-23       KernSmooth_2.23-26
 [5] stringi_1.8.7      lattice_0.22-7     hms_1.1.3          digest_0.6.37     
 [9] magrittr_2.0.3     RColorBrewer_1.1-3 evaluate_1.0.4     grid_4.5.1        
[13] fastmap_1.2.0      plyr_1.8.9         jsonlite_2.0.0     e1071_1.7-16      
[17] DBI_1.2.3          purrr_1.1.0        scales_1.4.0       cli_3.6.5         
[21] crayon_1.5.3       rlang_1.1.6        units_0.8-7        bit64_4.6.0-1     
[25] splines_4.5.1      withr_3.0.2        yaml_2.3.10        parallel_4.5.1    
[29] tools_4.5.1        tzdb_0.5.0         vctrs_0.6.5        R6_2.6.1          
[33] proxy_0.4-27       lifecycle_1.0.4    classInt_0.4-11    bit_4.6.0         
[37] htmlwidgets_1.6.4  vroom_1.6.5        pkgconfig_2.0.3    pillar_1.11.0     
[41] gtable_0.3.6       glue_1.8.0         Rcpp_1.1.0         xfun_0.52         
[45] tidyselect_1.2.1   dichromat_2.0-0.1  rstudioapi_0.17.1  knitr_1.50        
[49] farver_2.1.2       htmltools_0.5.8.1  labeling_0.4.3     rmarkdown_2.29    
[53] compiler_4.5.1    

References

Watson, Joe, James V Zidek, and Gavin Shaddick. 2019. “A General Theory for Preferential Sampling in Environmental Networks.”